Bartlett’s test (homogeneity of variances)#
Bartlett’s test answers a very specific question:
Do several groups look like they come from populations with the same variance?
It is most commonly used as a variance-assumption check before procedures that assume equal variances (homoscedasticity), such as:
one-way ANOVA
classical linear models with homoscedastic Gaussian noise
Because Bartlett’s test is derived under normality, it can be too eager to reject when the data are heavy‑tailed or skewed. In those cases, consider robust alternatives like Levene / Brown–Forsythe.
Learning goals#
State the null/alternative hypotheses for Bartlett’s test.
Understand the “why”: Bartlett as a likelihood‑ratio test under normality.
Implement the statistic from scratch with NumPy only.
Estimate p-values and critical values via Monte Carlo (NumPy only).
Use Plotly visuals to build intuition and interpret results.
Prerequisites#
Sample variance (unbiased,
ddof=1)Logs and basic algebra
Hypothesis testing basics: p-values, significance level α
What Bartlett’s test tests#
Assume we have k independent groups:
group i has observations \(x_{i1},\ldots,x_{in_i}\)
each group is (approximately) normal with variance \(\sigma_i^2\)
Hypotheses#
\(H_0\): \(\sigma_1^2 = \sigma_2^2 = \cdots = \sigma_k^2\) (all variances equal)
\(H_1\): not all \(\sigma_i^2\) are equal
The test produces a statistic \(T\) that (under \(H_0\) and normality) is approximately:
So large values of \(T\) are evidence against equal variances.
How to interpret the result#
Choose a significance level \(\alpha\) (commonly 0.05).
Compute the test statistic \(T\).
Compute the p-value: \(\mathbb{P}(\chi^2_{k-1} \ge T)\) (upper tail).
Interpretation:
Small p-value (\(p \le \alpha\)): reject \(H_0\) → at least one group variance differs.
Large p-value (\(p > \alpha\)): fail to reject \(H_0\) → data are compatible with equal variances (not proof).
In this notebook we’ll compute p-values / critical values with a Monte Carlo null distribution so we can stay NumPy-only.
import math
import numpy as np
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
SEED = 7
rng = np.random.default_rng(SEED)
Intuition via toy examples#
We’ll create two synthetic scenarios with the same group means:
Equal variances (Bartlett should not reject)
One group has a larger variance (Bartlett should reject)
def simulate_groups_normal(sigmas, n_per_group, mu=0.0, rng=None):
rng = np.random.default_rng() if rng is None else rng
groups = []
for sigma in sigmas:
groups.append(rng.normal(loc=mu, scale=float(sigma), size=int(n_per_group)))
return groups
k = 4
n_per_group = 40
sigmas_equal = [1.0] * k
sigmas_unequal = [1.0, 1.0, 1.0, 2.0]
groups_equal = simulate_groups_normal(sigmas_equal, n_per_group=n_per_group, rng=rng)
groups_unequal = simulate_groups_normal(sigmas_unequal, n_per_group=n_per_group, rng=rng)
n_equal = [g.size for g in groups_equal]
n_unequal = [g.size for g in groups_unequal]
n_equal, n_unequal
([40, 40, 40, 40], [40, 40, 40, 40])
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("$H_0$: equal variances", "$H_1$: one variance larger"),
)
for col, groups in enumerate([groups_equal, groups_unequal], start=1):
for i, g in enumerate(groups, start=1):
fig.add_trace(
go.Violin(
y=g,
name=f"G{i}",
box_visible=True,
meanline_visible=True,
showlegend=(col == 1),
),
row=1,
col=col,
)
fig.update_layout(
title="Same means, different spreads",
height=450,
width=950,
violingap=0.05,
violingroupgap=0.2,
)
fig.update_yaxes(title_text="Value", row=1, col=1)
fig.update_yaxes(title_text="Value", row=1, col=2)
fig.show()
def sample_variances(groups):
return np.array([np.var(g, ddof=1) for g in groups], dtype=float)
s2_equal = sample_variances(groups_equal)
s2_unequal = sample_variances(groups_unequal)
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("Sample variances (equal)", "Sample variances (unequal)"),
)
labels = [f"G{i}" for i in range(1, k + 1)]
fig.add_trace(go.Bar(x=labels, y=s2_equal, showlegend=False), row=1, col=1)
fig.add_trace(go.Bar(x=labels, y=s2_unequal, showlegend=False), row=1, col=2)
fig.update_yaxes(title_text="Sample variance", row=1, col=1)
fig.update_yaxes(title_text="Sample variance", row=1, col=2)
fig.update_layout(height=380, width=950)
fig.show()
The Bartlett statistic (under normality)#
Let:
\(n_i\) = sample size of group i
\(s_i^2\) = unbiased sample variance of group i (
ddof=1)\(N = \sum_i n_i\)
\(k\) = number of groups
Define the pooled variance:
Bartlett’s statistic is:
with a small correction factor:
Intuition
If all group variances are similar, the pooled variance explains them well → \(T\) is small.
If some \(s_i^2\) differ a lot, the log terms diverge → \(T\) grows.
(Under \(H_0\) and normality, \(T\) is approximately \(\chi^2_{k-1}\).)
Where the formula comes from (likelihood‑ratio test)#
Assuming each group is normally distributed, Bartlett’s test can be motivated as a likelihood‑ratio test.
Under \(H_1\) (variances free): each group has its own variance \(\sigma_i^2\), and the likelihood is maximized by the group sample variances.
Under \(H_0\) (common variance): all groups share one variance \(\sigma^2\), and the likelihood is maximized by the pooled variance \(s_p^2\).
The likelihood ratio \(\Lambda = L(H_0)/L(H_1)\) compares the best single-variance explanation to the best separate-variances explanation. After algebra, \(-2\log \Lambda\) is (up to constants) exactly the numerator you saw: a weighted difference between \(\log(s_p^2)\) and the \(\log(s_i^2)\).
Bartlett’s correction factor \(C\) is a small finite-sample adjustment that makes the \(\chi^2_{k-1}\) approximation more accurate.
def bartlett_statistic(groups):
"Compute Bartlett's test statistic T (NumPy-only)."
groups = [np.asarray(g, dtype=float).ravel() for g in groups]
k = len(groups)
if k < 2:
raise ValueError("Need at least two groups.")
sample_sizes = np.array([g.size for g in groups], dtype=int)
if np.any(sample_sizes < 2):
raise ValueError("Each group must have at least 2 observations.")
df = k - 1
weights = sample_sizes - 1
n_minus_k = int(sample_sizes.sum() - k)
sample_variances = np.array([np.var(g, ddof=1) for g in groups], dtype=float)
if np.any(sample_variances <= 0):
raise ValueError(
"All groups must have positive sample variance (no constant groups)."
)
pooled_var = float(np.sum(weights * sample_variances) / n_minus_k)
numerator = n_minus_k * np.log(pooled_var) - np.sum(weights * np.log(sample_variances))
c = 1.0 + (1.0 / (3.0 * df)) * (np.sum(1.0 / weights) - (1.0 / n_minus_k))
T = float(numerator / c)
return T, int(df), pooled_var, sample_sizes, sample_variances
T_equal, df, pooled_equal, n_equal, s2_equal = bartlett_statistic(groups_equal)
T_unequal, _, pooled_unequal, n_unequal, s2_unequal = bartlett_statistic(groups_unequal)
print(f"Equal-variance example: T={T_equal:.3f}, df={df}, pooled_var={pooled_equal:.3f}")
print(f"Unequal-variance example: T={T_unequal:.3f}, df={df}, pooled_var={pooled_unequal:.3f}")
print()
print("Sample variances (equal): ", np.round(s2_equal, 3))
print("Sample variances (unequal):", np.round(s2_unequal, 3))
Equal-variance example: T=1.145, df=3, pooled_var=0.766
Unequal-variance example: T=19.689, df=3, pooled_var=1.409
Sample variances (equal): [0.674 0.834 0.675 0.881]
Sample variances (unequal): [0.7 1.049 1.2 2.689]
NumPy-only p-values via a Monte Carlo null distribution#
Under \(H_0\) (equal variances), the Bartlett statistic is scale-invariant: multiplying all data by a constant does not change \(T\).
That means the null distribution depends only on the group sizes \((n_1,\ldots,n_k)\). So we can estimate a p-value without SciPy:
Simulate many datasets under \(H_0\) using standard normal samples with the same group sizes.
Compute \(T\) for each simulated dataset.
Approximate
p-value: fraction of simulated \(T\) values greater than or equal to the observed \(T\)
critical value at level \(\alpha\): the \((1-\alpha)\) quantile of the simulated null distribution
This is slower than the asymptotic \(\chi^2\) approximation, but it stays NumPy-only and is very explicit.
def chi2_pdf(x, df):
x = np.asarray(x, dtype=float)
df = float(df)
coeff = 1.0 / (2.0 ** (df / 2.0) * math.gamma(df / 2.0))
return coeff * np.power(x, df / 2.0 - 1.0) * np.exp(-x / 2.0)
def bartlett_null_distribution(sample_sizes, n_sim=25_000, seed=0):
"Monte Carlo null distribution of Bartlett's T for given group sizes."
sample_sizes = np.asarray(sample_sizes, dtype=int)
if sample_sizes.ndim != 1 or sample_sizes.size < 2:
raise ValueError("sample_sizes must be 1D with length >= 2")
if np.any(sample_sizes < 2):
raise ValueError("All sample sizes must be >= 2")
rng = np.random.default_rng(seed)
k = int(sample_sizes.size)
df = k - 1
weights = sample_sizes - 1
n_minus_k = int(sample_sizes.sum() - k)
c = 1.0 + (1.0 / (3.0 * df)) * (np.sum(1.0 / weights) - (1.0 / n_minus_k))
weighted_s2_sum = np.zeros(n_sim, dtype=float)
weighted_log_s2_sum = np.zeros(n_sim, dtype=float)
for ni, wi in zip(sample_sizes, weights):
x = rng.standard_normal(size=(n_sim, int(ni)))
s2 = np.var(x, axis=1, ddof=1)
weighted_s2_sum += wi * s2
weighted_log_s2_sum += wi * np.log(s2)
pooled = weighted_s2_sum / n_minus_k
stats = (n_minus_k * np.log(pooled) - weighted_log_s2_sum) / c
return stats, df
def bartlett_test_numpy(groups, n_sim=25_000, seed=0):
T, df, pooled_var, sample_sizes, sample_variances = bartlett_statistic(groups)
null_stats, _ = bartlett_null_distribution(sample_sizes, n_sim=n_sim, seed=seed)
p_value_mc = float(np.mean(null_stats >= T))
return {
"stat": T,
"df": df,
"p_value_mc": p_value_mc,
"pooled_var": pooled_var,
"sample_sizes": sample_sizes,
"sample_variances": sample_variances,
"null_stats": null_stats,
}
alpha = 0.05
res_equal = bartlett_test_numpy(groups_equal, n_sim=35_000, seed=1)
res_unequal = bartlett_test_numpy(groups_unequal, n_sim=35_000, seed=1)
crit = float(np.quantile(res_equal["null_stats"], 1 - alpha))
print(f"alpha = {alpha}")
print(f"critical value (MC) ≈ {crit:.3f}\n")
print("Equal-variance example")
print(f" T = {res_equal['stat']:.3f} (df={res_equal['df']})")
print(f" p ≈ {res_equal['p_value_mc']:.4f}\n")
print("Unequal-variance example")
print(f" T = {res_unequal['stat']:.3f} (df={res_unequal['df']})")
print(f" p ≈ {res_unequal['p_value_mc']:.4f}")
alpha = 0.05
critical value (MC) ≈ 7.786
Equal-variance example
T = 1.145 (df=3)
p ≈ 0.7666
Unequal-variance example
T = 19.689 (df=3)
p ≈ 0.0001
null_stats = res_equal["null_stats"]
x_max = float(np.quantile(null_stats, 0.999))
x_grid = np.linspace(1e-6, x_max, 400)
fig = go.Figure()
fig.add_trace(
go.Histogram(
x=null_stats,
nbinsx=70,
histnorm="probability density",
name="Monte Carlo null",
opacity=0.6,
)
)
fig.add_trace(
go.Scatter(
x=x_grid,
y=chi2_pdf(x_grid, df=res_equal["df"]),
mode="lines",
name=f"χ²(df={res_equal['df']}) approx",
)
)
fig.add_vline(
x=crit,
line=dict(color="black", dash="dash"),
annotation_text=f"crit (α={alpha})",
annotation_position="top left",
)
fig.add_vline(
x=res_equal["stat"],
line=dict(color="#2ca02c"),
annotation_text="observed (equal)",
annotation_position="top right",
)
fig.add_vline(
x=res_unequal["stat"],
line=dict(color="#d62728"),
annotation_text="observed (unequal)",
annotation_position="top right",
)
fig.update_layout(
title="Bartlett statistic under H₀ (Monte Carlo) vs observed",
xaxis_title="T",
yaxis_title="Density",
bargap=0.05,
height=450,
)
fig.show()
What “rejecting” means (and what it doesn’t)#
If you reject \(H_0\), you have evidence that at least one group variance differs.
Bartlett’s test does not tell you which groups differ.
A non-rejection does not prove variances are equal; it only means the data did not provide strong evidence against equality.
A practical next step (if you reject) is to use methods that do not require equal variances, e.g. Welch’s ANOVA, robust regression, or variance-stabilizing transforms.
Assumptions and pitfalls#
Bartlett’s test is powerful under its assumptions, but those assumptions matter:
Normality within each group (or close to it). Bartlett can reject because of non-normality even when variances are equal.
Independence within and across groups.
Outliers: a single extreme value can inflate a sample variance and trigger rejection.
If your groups are clearly skewed or heavy-tailed, prefer Levene / Brown–Forsythe (more robust), or consider transforming the data.
def simulate_bartlett_stats(distribution_fn, sample_sizes, n_rep=5_000, seed=0):
# distribution_fn(rng, size) must return an array of the given shape
rng = np.random.default_rng(seed)
sample_sizes = np.asarray(sample_sizes, dtype=int)
k = int(sample_sizes.size)
df = k - 1
weights = sample_sizes - 1
n_minus_k = int(sample_sizes.sum() - k)
c = 1.0 + (1.0 / (3.0 * df)) * (np.sum(1.0 / weights) - (1.0 / n_minus_k))
weighted_s2_sum = np.zeros(n_rep, dtype=float)
weighted_log_s2_sum = np.zeros(n_rep, dtype=float)
for ni, wi in zip(sample_sizes, weights):
x = distribution_fn(rng, size=(n_rep, int(ni)))
s2 = np.var(x, axis=1, ddof=1)
weighted_s2_sum += wi * s2
weighted_log_s2_sum += wi * np.log(s2)
pooled = weighted_s2_sum / n_minus_k
stats = (n_minus_k * np.log(pooled) - weighted_log_s2_sum) / c
return stats
sample_sizes = np.array([20, 20, 20])
alpha = 0.05
# Critical value calibrated under normality
null_stats_normal, df = bartlett_null_distribution(sample_sizes, n_sim=60_000, seed=0)
crit = float(np.quantile(null_stats_normal, 1 - alpha))
# Three equal-variance worlds: normal vs heavy-tailed vs skewed
stats_normal = simulate_bartlett_stats(
lambda rng, size: rng.standard_normal(size),
sample_sizes,
n_rep=6_000,
seed=1,
)
# t(df=3) scaled to variance ~ 1 (Var = df/(df-2))
scale_t3 = math.sqrt((3 - 2) / 3)
stats_t3 = simulate_bartlett_stats(
lambda rng, size: rng.standard_t(df=3, size=size) * scale_t3,
sample_sizes,
n_rep=6_000,
seed=2,
)
# Lognormal scaled to variance ~ 1
sigma_logn = 1.0
var_logn = math.exp(2 * sigma_logn**2) - math.exp(sigma_logn**2)
stats_logn = simulate_bartlett_stats(
lambda rng, size: rng.lognormal(mean=0.0, sigma=sigma_logn, size=size) / math.sqrt(var_logn),
sample_sizes,
n_rep=6_000,
seed=3,
)
empirical_type1 = {
"Normal (assumption OK)": float(np.mean(stats_normal >= crit)),
"t(df=3) heavy-tailed": float(np.mean(stats_t3 >= crit)),
"Lognormal (skewed)": float(np.mean(stats_logn >= crit)),
}
empirical_type1
{'Normal (assumption OK)': 0.052,
't(df=3) heavy-tailed': 0.4086666666666667,
'Lognormal (skewed)': 0.6696666666666666}
labels = list(empirical_type1.keys())
values = list(empirical_type1.values())
fig = go.Figure(
go.Bar(
x=labels,
y=values,
text=[f"{v:.3f}" for v in values],
textposition="outside",
)
)
fig.add_hline(y=alpha, line=dict(color="black", dash="dash"))
fig.update_layout(
title=f"Empirical Type I error at α={alpha} (true variances equal)",
xaxis_title="Data distribution",
yaxis_title="P(reject H₀)",
yaxis=dict(range=[0, max(values) + 0.05]),
height=420,
)
fig.show()
Power: when the variances truly differ#
“Power” here means: if one group’s variance is larger, how often does Bartlett’s test reject \(H_0\)?
We’ll simulate normal data with 4 groups of equal size and vary the ratio \(r = \sigma_4/\sigma\).
def simulate_bartlett_stats_normal(sample_sizes, sigmas, n_rep=4_000, seed=0):
rng = np.random.default_rng(seed)
sample_sizes = np.asarray(sample_sizes, dtype=int)
sigmas = np.asarray(sigmas, dtype=float)
if sample_sizes.size != sigmas.size:
raise ValueError("sample_sizes and sigmas must have same length")
k = int(sample_sizes.size)
df = k - 1
weights = sample_sizes - 1
n_minus_k = int(sample_sizes.sum() - k)
c = 1.0 + (1.0 / (3.0 * df)) * (np.sum(1.0 / weights) - (1.0 / n_minus_k))
weighted_s2_sum = np.zeros(n_rep, dtype=float)
weighted_log_s2_sum = np.zeros(n_rep, dtype=float)
for ni, wi, sigma in zip(sample_sizes, weights, sigmas):
x = rng.normal(loc=0.0, scale=float(sigma), size=(n_rep, int(ni)))
s2 = np.var(x, axis=1, ddof=1)
weighted_s2_sum += wi * s2
weighted_log_s2_sum += wi * np.log(s2)
pooled = weighted_s2_sum / n_minus_k
stats = (n_minus_k * np.log(pooled) - weighted_log_s2_sum) / c
return stats
sample_sizes = np.array([25, 25, 25, 25])
alpha = 0.05
null_stats, df = bartlett_null_distribution(sample_sizes, n_sim=60_000, seed=10)
crit = float(np.quantile(null_stats, 1 - alpha))
ratios = np.array([1.0, 1.1, 1.25, 1.5, 2.0, 3.0])
power = []
for r in ratios:
sigmas = np.array([1.0, 1.0, 1.0, r])
stats = simulate_bartlett_stats_normal(sample_sizes, sigmas, n_rep=5_000, seed=int(100 * r) + 1)
power.append(float(np.mean(stats >= crit)))
power
[0.05, 0.066, 0.1876, 0.5396, 0.949, 1.0]
fig = go.Figure()
fig.add_trace(go.Scatter(x=ratios, y=power, mode="lines+markers", name="power"))
fig.add_hline(y=alpha, line=dict(color="black", dash="dot"), annotation_text="α", annotation_position="bottom right")
fig.update_layout(
title="Power vs variance ratio (normal data)",
xaxis_title=r"Variance ratio r = σ₄ / σ",
yaxis_title="P(reject H₀)",
yaxis=dict(range=[0, 1.05]),
height=420,
)
fig.show()
Practical checklist#
Use Bartlett when normality is plausible and you care about power.
If you reject \(H_0\):
consider Welch’s ANOVA (for means with unequal variances)
use a variance-stabilizing transform (e.g., log for positive skew)
use robust tests (Levene / Brown–Forsythe) when normality is doubtful
If you don’t reject \(H_0\):
treat it as “no strong evidence of unequal variances”, not proof of equality
Tip: if you’re unsure about normality, you can skip the pre-test and directly use methods that are robust to unequal variances.
Exercises#
Show (algebraically) that Bartlett’s statistic is scale-invariant.
Modify
bartlett_null_distributionto accept unequal group sizes like[10, 20, 50]and compare the null histogram shape.Implement Levene’s test (median-centered Brown–Forsythe variant) with NumPy and compare its Type I error to Bartlett on skewed data.
References#
Bartlett, M. S. (1937). Properties of sufficiency and statistical tests. Proceedings of the Royal Society.
Any mathematical statistics text covering likelihood-ratio tests for normal models.
SciPy reference implementation:
scipy.stats.bartlett(useful to compare against your NumPy implementation).